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Abstract.  Current  router  models  [2,  3,  5,  6]  assume  that 
clock  cycle  time  depends  solely  on  router  latency.  How- 
ever,  in  practice,  routers  are  heavily  pipelined,  making 
cycle  time  largely  independent  of  router  latency.  In  this 
paper,  we  describe  a  router  delay  model  that  accurately 
accounts  for  pipelining  based  on  technology-independent 
delay  estimates  derived  through  detailed  gate-level  analy¬ 
sis.  Simulations  of  realistic  router  pipelines  show  signifi¬ 
cant  performance  differences  compared  with  the 
commonly-assumed  unit-latency  model.  Using  realistic 
pipeline  models,  we  compared  wormhole  and  virtual-chan¬ 
nel  flow  control  Our  results  show  that  virtual  channels 
incur  a  modest  additional  cycle  of  per-hop  router  latency 
which  is  more  than  offset  by  the  25-40%  throughput 
improvement  over  a  wormhole  router. 

1.  Introduction 

Most  current  literature  in  interconnection  networks 
reports  comparisons  of  different  flow  control  and  routing 
techniques  without  considering  implementation  complex¬ 
ity  and  the  impact  on  router  delay,  simply  assuming  unit 
router  delay.  This  can  lead  to  inaccurate  and  skewed  com¬ 
parisons.  A  router  delay  model  which  enables  designers 
and  researchers  to  factor  in  implementation-specific  delay 
estimates  will  thus  be  invaluable. 

Chien  [2,  3]  proposed  a  router  model  for  wormhole  and 
virtual-channel  routers^  to  address  this  need.  In  his  model, 
he  presented  a  canonical  router  architecture  as  depicted  in 
Figure  1,  which  can  be  applied  to  all  routers,  regardless  of 
the  flow  control  or  routing  technique  governing  the  router. 
The  canonical  router  architecture  consists  of  the  following 
functions:-  address  decoding  (AD),  flow  control  (FC), 
routing  header  selection  (SEL),  crossbar  arbitration  (ARB), 
crossbar  traversal  (CB)  and  virtual  channel  controllers 
(VC),  and  the  model  defines  per-hop  router  latency  as  the 
total  delay  of  the  functions  on  the  critical  path,  i.e.  router 


1.  Miller  and  Najjar  extended  Chien^  model  for  virtual  cut-through 
routers,  modifying  the  parameterized  delay  equation  for  Tpc  to 
include  the  parameter  B,  the  number  of  buffers  in  that  input  queue  [6]. 
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Figure  1.  Canonical  router  architecture  proposed  in 
Chien§  model.  The  parameters  of  the  delay  model  are  P, 
the  number  of  ports  on  the  crossbar;  F,  the  number  of 
output  route  choices;  and  V,  the  number  of  virtual  channels 
per  physical  channel. 

latency  =  T^d  +  Tsel  +  ^arb  +  Tcb  +  Tyc-  Through 
detailed  gate-level  design  and  analysis,  the  delay  of  these 
functions  are  expressed  in  parameterized  equations  which 
are  then  grounded  in  a  0.8  micron  CMOS  process.  By  sub¬ 
stituting  the  parameters  of  a  router  into  the  parametric 
equations,  a  designer  can  obtain  estimates  of  router  latency 
easily,  and  use  that  as  a  basis  of  comparison. 

However,  a  major  oversight  in  Chien^  model  is  the 
omission  of  pipelining,  which  is  present  in  most  practical 
routers.  Duato  [5]  hence  proposed  an  extension  of  the 
model  to  pipelined  routers.  His  model  groups  the  functions 
in  the  original  Chienh  model  into  3  pipeline  stages  :-  T  r, 
the  routing  stage  which  encompasses  Tad» 

TarbJ  Ts,  the  switching  stage  which  includes  Tpc,  T^b 
and  the  delay  incurred  in  latching  a  flit  at  the  output  port; 
To  the  channel  stage  which  includes  Tvc  and  the  inter¬ 
node  delay.  The  clock  cycle  is  then  prescribed  as  the  maxi¬ 
mum  delay  of  these  3  stages  and  per-hop  router  latency  is 
thrice  the  clock  cycle  time.  Both  these  models  are  based 
upon  the  premise  that  clock  cycle  time  depends  solely  on 
router  latency  though,  which  is  usually  not  the  case  in  prac¬ 
tical  router  design.  Typically,  router  designers  have  to 
work  within  the  limits  of  a  clock  cycle  which  is  determined 
by  factors  beyond  the  router,  such  as  the  fundamental  lim¬ 
its  of  chip-to-chip  signalling  [1],  or  the  processor  clock 
cycle.  Hence,  setting  hard  pipeline  boundaries  and  fitting 
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the  clock  cycle  around  them  will  not  work.  A  delay  model 
has  to  advocate  a  good  pipeline  design  for  a  router,  given  a 
particular  clock  cycle  time. 

Besides,  these  models  abstract  routers  to  a  single  canon¬ 
ical  architecture,  which  result  in  a  micro-architecture 
which  is  not  optimal  for  certain  types  of  routers.  For 
instance,  in  the  proposed  canonical  router  architecture,  pas¬ 
sage  through  the  crossbar  is  arbitrated  on  a  per-packet  basis 
and  held  throughout  the  duration  of  a  packet,  thus  prompt¬ 
ing  the  need  for  a  huge  crossbar  in  a  virtual-channel  router, 
with  the  number  of  ports  equal  to  the  total  number  of  vir¬ 
tual  channels.  This  contributes  unnecessary  delay  in  the 
crossbar  arbitration  and  traversal  functions.  Buffering  flits 
at  virtual  channel  controllers  whose  arbitration  delay 
increases  with  the  number  of  virtual  channels  also  adds 
needless  cost  to  a  virtual-channel  router.  Thus,  there  is  a 
need  for  canonical  router  architectures  which  are  tailored 
optimally  for  different  flow  control  techniques. 

2.  Proposed  model 

In  this  paper,  we  propose  a  delay  model  which  attempts 
to  address  the  issues  raised.  It  comprises  a  general  router 
model  which  outlines  a  design  methodology  for  the  pipe¬ 
lining  of  a  router  given  a  clock  cycle  time,  using  the  delay 
estimates  derived  by  a  specific  router  model  to  prescribe  a 
pipeline  design. 

2.1  Canonical  router  architectures 
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Figure  2.  Canonical  wormhole  router  architecture. 
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First,  the  model  proposes  canonical  router  architectures 
which  are  tailored  to  each  flow  control  technique.  Figure  2 
illustrates  the  canonical  wormhole  router  architecture^, 
where  head  flits  enter  the  input  controller  at  each  port,  pro¬ 
ceed  through  routing y  before  sending  their  requests  for 
desired  output  ports  to  the  switch  arbiter.  A  state  machine 
at  each  input  channel  (inpcjstate)  dictates  the  state  in 
which  each  packet  is  in:  routing,  switch  arbitration  or 
crossbar  traversal.  Upon  receiving  requests  from  all  input 
controllers,  the  global  switch  arbiter  consults  the  status  of 
output  ports  as  stored  in  the  state  machine  for  each  output 
port  (outpc_state),  resolves  conflicts  and  grants  free  ports 
to  requestors.  Passage  through  the  crossbar  is  then  held  for 
the  entire  duration  of  the  packet,  till  the  tail  flit  leaves  and 
releases  this  hold.  As  shown  in  the  figure,  the  parameters 
affecting  the  delay  of  the  various  modules  of  a  wormhole 


2.  This  is  similar  to  the  canonical  router  architecture  proposed  in  Chien^ 
model. 

3.  Throughout  the  paper,  our  emphasis  is  on  a  comparison  of  flow  con¬ 
trol  techniques.  Hence,  we  will  be  viewing  routing  as  a  black  box,  and 
assuming  decoding  and  routing  takes  a  typical  clock  cycle  of  2OT4. 


Figure  3.  Canonical  virtual-channel  router  architecture. 

router  are  p,  the  number  of  ports  on  the  crossbar,  and  w,  the 
channel  width  or  phit  size. 

Figure  3  shows  the  canonical  router  architecture  pro¬ 
posed  for  virtual-channel  flow  control.  Here,  after  selecting 
an  output  port  through  the  routing  logic,  a  head  flit  submits 
requests  for  desired  virtual  channels  to  the  global  virtual 
channel  allocator,  which  consults  the  status  of  virtual 
channels  in  outvcjstate  and  grants  free  virtual  channels  to 
requestors.  Flits  of  a  packet  that  has  secured  an  output  vir¬ 
tual  channel  then  arbitrate  for  passage  through  the  crossbar 
switch  each  cycle.  Instead  of  reserving  passage  through  the 
crossbar  for  the  entire  duration  of  a  packet,  the  crossbar  is 
apportioned  to  flits  of  different  packets  on  a  cycle-by-cycle 
basis.  In  this  router  architecture,  the  function  of  multiplex¬ 
ing  virtual  channels  onto  a  physical  channel  rests  upon  the 
switch  allocator,  instead  of  virtual  channel  controllers.  The 
additional  parameter  for  virtual-channel  flow  control  is  v, 
the  number  of  virtual  channels  per  physical  channel,  which 
affects  the  delay  of  the  virtual  channel  allocator. 
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Figure  4.  Atomic  phases  and  dependences  of  (a)  a 
wormhole  router;  (b)  a  virtual-channel  router;  (c)  a 
speculative  virtual-channel  router. 


Figure  5.  Latency  and  cycle  time  estimates  derived  by  the 
specific  router  model. 


virtual-channel  routers,  which  can  potentially  reduce  vir¬ 
tual-channel  router  latency  to  that  of  a  wormhole  router. 

2.3  Pipeline  design 


2.2  Atomic  modules  and  dependencies 

The  canonical  router  architectures  prescribe  the  organi¬ 
zation  of  the  different  functions  of  a  router.  Realization  of 
these  flmctions  in  hardware  will  uncover  modules  which 
are  not  amenable  to  pipelining.  These  modules  are  termed 
"'‘atomic  modules"'  in  our  model,  and  are  best  kept  intact 
within  a  single  pipeline  stage.  An  example  of  an  atomic 
module  is  the  virtual-channel  allocator  in  a  virtual-channel 
router.  If  this  module  straddles  multiple  pipeline  stages,  it 
can  result  in  grants  not  being  reflected  correctly  before  the 
next  allocation.  Besides,  with  a  separable  allocator,  there 
are  lots  of  wires  connecting  the  input  and  output  ports, 
which  will  require  excessive  latching  should  the  allocator 
be  binned  in  multiple  pipeline  stages.  Figure  4  shows  the 
various  atomic  modules  of  wormhole  and  virtual  channel 
routers  in  our  model. 

The  inputs  of  these  atomic  modules  may  depend  on  the 
outputs  of  another,  in  which  case,  a  dependency  exists. 
These  dependencies  determine  the  critical  path  of  a  router. 
Figure  4(a)  and  (b)  shows  the  basic  dependencies  of  a 
wormhole  and  virtual-channel  router  respectively. 

Sometimes,  these  dependencies  can  be  averted  with 
speculation.  For  instance,  the  switch  allocator  in  a  virtual- 
channel  router  can  speculatively  assume  that  the  packet 
will  manage  to  obtain  a  free  output  virtual  channel  from  the 
virtual  channel  allocator,  and  thus,  proceed  to  request  for 
the  desired  output  port  before  it  has  secured  an  output  vir¬ 
tual  channel.  Should  the  speculation  be  incorrect,  the 
crossbar  passage  reserved  will  just  be  wasted.  With  specu¬ 
lation,  a  virtual-channel  router  removes  the  dependency 
between  the  virtual  channel  allocation  and  switch  alloca¬ 
tion  phases,  and  cuts  down  on  its  critical  path  delay,  as 
shown  in  Figure  4(c).  However,  the  speculation  may  result 
in  lower  throughput  as  the  crossbar  switch  may  be  allo¬ 
cated  to  packets  which  are  unable  to  use  it  due  to  unavail¬ 
able  virtual  channels.  While  this  paper  reports  comparisons 
of  just  basic  wormhole  and  virtual-channel  routers,  work  is 
currently  under-going  to  apply  our  model  to  speculative 


The  delay  of  each  atomic  module  is  then  modelled  by 
the  parametric  equations  derived  by  the  specific  router 
model  (Section  2.4  on  page  3)  which  generates  two  delay 
estimates:  latency  {t^  and  cycle  (c,)  delay.  Latency  delay 
spans  from  when  inputs  are  presented  to  the  module,  to 
when  the  outputs  which  are  needed  by  the  next  module  are 
stable.  Cycle  delay  refers  to  the  delay  expended  by  addi¬ 
tional  circuitry  required  before  the  next  set  of  inputs  can  be 
presented  to  the  module.  Figure  5  shows  these  delay  com¬ 
ponents.  In  a  switch  arbiter,  for  example,  latency  spans 
from  when  requests  for  crossbar  ports  are  presented  to 
when  grant  signals  are  stable,  while  cycle  delay  refers  to 
the  delay  in  the  circuits  for  updating  the  status  of  output 
ports,  so  that  subsequent  arbitrations  do  not  grant  output 
ports  which  have  already  been  reserved. 

Armed  with  t^  and  c,-  of  each  atomic  phase  on  the  critical 
path,  the  model  prescribes  the  pipelining  of  the  router  as 
follows 

Given  a  is  the  first  atomic  module  in  a  pipeline  stage  and 
b  is  the  last  atomic  module  ^ 

b 

^  tf  +  C*  ^  elk  and 
/=«! 

6+1  6 

+  Ct,|  >  elk  and  ^ 

i=a  i=a—] 

(EQl) 

2.4  Specific  router  model 

The  specific  router  model  is  responsible  for  the  deriva¬ 
tion  of  the  parameterized  delay  equations  for  atomic  mod¬ 
ules.  We  ground  our  specific  delay  model  on  the  theory  of 
logical  effort  [7,  8]  which  was  proposed  by  Sutherland, 
Sproull  and  Harris  for  estimating  circuit  delay  and  guiding 
the  design  of  circuits  with  minimum  delay. 

The  theory  of  logical  effort  prescribes  that  circuit  delay 
along  a  path  is  the  sum  of  the  delay  effort  and  the  parasitic 
delay  of  that  path,  and  delay  effort  of  each  stage  is  the 
product  of  logical  effort  and  electrical  effort  (EQ  2).  Logi¬ 
cal  effort  refers  to  the  number  of  times  worse  the  gate  is  at 


TABLE  1.  Parameterized  delay  equations  (in  T)  for 
wormhole  and  virtual-channel  routers. 
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delivering  output  current  than  would  be  an  inverter  with 
identical  input  capacitance,  while  electrical  effort  refers  to 
the  ratio  of  output  capacitance  to  input  capacitance,  or 
fanout. 

DelayQn  {Delay  Effort Parasitic  Delay  {in  T)}+5 

where  ~  logical  effort  per  stage; 

A,  =  electrical  effort  per  stage; 
p^  =  parasitic  delay; 

Through  detailed  gate-level  design  and  analysis  of  the 
circuits  of  each  atomic  module^  and  the  application  of 
equation  2,  technology-independent  parameterized  delay 
equations  for  each  atomic  module  are  derived  and  listed  in 
Table  1.  Projections  of  the  model  are  validated  against 
Synopsys  timing  analyzer  in  a  0.18  micron  technology  and 
found  to  be  close. 
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Figure  6.  Effect  of  p,  the  number  of  physical  channels 
(pcs)  and  v,  the  number  of  virtual  channels  (ves)  on  per- 
node  latency  of  virtual-channel  routers.  A  typical  clock  cycle 
of  20  T4  was  assumed.  Each  bar  illustrates  the  router 
pipeline,  with  the  shaded  regions  corresponding  to  the 
fraction  of  clock  cycle  time  used  by  each  atomic  module, 

3.  Insights  from  model 

Figure  6  details  the  effect  of  differing  numbers  of  phys¬ 
ical  and  virtual  channels  on  the  per-node  latency  of  a  vir¬ 
tual-channel  router,  with  a  typical  clock  cycle  of  20 14^. 
For  a  2-dimensional  virtual-channel  router  with  5  physical 
channels,  4  pipeline  stages  are  sufficient  for  up  to  16  vir¬ 
tual  channels  per  physical  channel.  Thus,  overall  per-node 
latency  does  not  increase  as  the  number  of  virtual  channels 
increases  from  2  to  16.  With  a  3-dimensional  network,  a 
virtual-channel  router  with  7  physical  channels  will  incur 
the  same  per-node  latency  of  4  cycles  (due  to  its  4  pipeline 
stages)  for  up  to  16  virtual  channels  per  physical  channel 
too. 

These  model  projections  indicate  that  a  virtual-channel 
router  typically  requires  just  one  more  pipeline  stage  than  a 
wormhole  router^.  Also,  with  most  practical  numbers  of 
virtual  channels  used  to  date,  the  delay  of  a  virtual-channel 
router  remains  unchanged. 

4.  Simulation  results 

Based  on  the  pipelined  designs  prescribed  by  the  model, 
detailed  Verilog  code  for  wormhole  and  virtual  channel 
routers  were  crafted  and  simulations  carried  out  to  deter¬ 
mine  their  latency-throughput  characteristics.  A  typical 


4.  Due  to  space  constraint,  detailed  gate-level  designs  of  the  routers  can¬ 
not  be  presented.  In  brief,  separable  allocators  are  used,  and  the  arbiter 
adopted  is  the  matrix  arbiter. 


5.  T4  refers  to  the  delay  of  an  inverter  driving  4  other  inverters  [4]. 

6.  If  the  virtual-channel  router  is  speculative,  the  per-hop  latency  can 
potentially  be  the  same  as  that  of  the  wormhole  router. 
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Figure  7.  Latency-throughput  curves  of  wormhole  and 
virtual-channel  routers  with  8  buffers  per  input  port.  The 
routers  adhere  to  the  proposed  pipelined  router  model. 

clock  cycle  of  20  T4  was  assumed.  The  simulator  generates 
uniformly  distributed  traffic  across  a  8-by-8  mesh  network 
to  random  destinations.  Each  simulation  is  run  for  a  warm¬ 
up  phase  of  10,000  cycles.  Thereafter,  100,000  packets  are 
injected  and  the  simulation  is  run  till  these  packets  in  the 
sample  space  have  all  been  received.  A  constant  rate 
source  injects  5-flit  packets  at  a  percentage  of  the  capacity 
of  the  network  and  the  average  latency  of  the  packets  is 
calculated.  Latency  spans  the  instant  when  the  first  flit  of 
the  packet  is  created,  to  the  time  when  its  last  flit  is  ejected 
at  the  destination  node,  including  source  queuing  time  and 
assuming  immediate  ejection.  Each  router  uses  credit- 
based  flow  control  to  regulate  the  use  of  buffers,  and  prop¬ 
agation  delay  across  the  channel  is  assumed  to  take  a  single 
cycle.  Since  the  purpose  of  our  simulations  is  to  explore 
the  performance  of  flow  control  strategies,  we  chose  sim¬ 
ple  dimension-ordered  routing. 

4,1  Performance  of  wormhole  vs*  virtual*channel 
routers 

Figure?  shows  the  latency-throughput  curves  of  a 
wormhole  router  and  a  virtual-channel  router  with  8  buff¬ 
ers  per  input  port.  The  base  latency  of  the  wormhole  router 
(29  cycles)  is  lower  than  that  of  the  basic  virtual-channel 
router  (36  cycles),  since  the  wormhole  router  uses  a  3-stage 
pipeline  while  the  virtual-channel  router  is  pipelined  over  4 
stages  at  each  hop.  However,  virtual-channel  flow  control 
with  2  virtual  channels  extends  the  throughput  achieved  by 
wormhole  flow  control  by  25%  from  40%  capacity  to  50% 
capacity. 

With  16  buffers  per  input  port,  similar  effects  were 
observed,  as  shown  in  Figure  8.  Again,  a  basic  virtual- 
channel  router  has  a  higher  base  latency  (35  cycles)  as 
compared  to  that  of  a  wormhole  router,  due  to  the  addi- 
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Figure  8.  Latency-throughput  curves  of  wormhole  and 
virtual-channel  routers  with  16  buffers  per  input  port.  The 
routers  are  pipelined  as  prescribed  by  the  proposed 
pipelined  router  model. 
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Figure  9.  Performance  of  wormhole  and  virtual  channel 
routers,  as  modelled  by  the  proposed  pipelined  delay 
model,  and  as  modelled  assuming  a  single-cycle  router 
delay.  (8  buffers  per  input  port) 

tional  pipeline  stage  per  hop.  However,  the  advancement  in 
throughput  is  large,  as  a  virtual-channel  router  with  2  vir¬ 
tual  channels  enjoys  a  throughput  of  65%  capacity,  a  30% 
improvement  over  that  achieved  by  the  wormhole  router 
(50%  capacity).  With  4  virtual  channels,  throughput  is  fur¬ 
ther  extended  to  70%  capacity,  a  40%  improvement  over 
wormhole  flow  control. 

4.2  Effect  of  assuming  single-cycle  router  latency 
vs.  multiple-cycle  pipelined  design 

Most  published  research  compares  the  performance  of 
different  router  designs  assuming  a  single-cycle  router 
latency,  without  taking  into  account  implementation  com¬ 
plexity  and  cost.  To  quantify  this  effect,  we  ran  simulations 
with  a  cycle-accurate  C  simulator  which  assumes  single- 


cycle  router  latency  for  both  wormhole  and  virtual-channel 
flow  control.  All  other  experimental  parameters  are  identi¬ 
cal  to  that  of  the  Verilog  simulator. 

As  shown  in  Figure  9,  assuming  single-cycle  router 
latency  results  in  both  wormhole  and  virtual-channel  rout¬ 
ers  incurring  the  same  low  base  latency  of  16  cycles, 
whereas  simulations  which  adhere  to  our  proposed  multi¬ 
ple-cycle  pipeline  design  highlight  the  higher  base  latency 
incurred  by  virtual-channel  flow  control  due  to  its  longer 
pipeline. 

It  is  also  apparent  that  assuming  that  a  single-cycle 
router  latency  results  in  inflated  throughput  figures.  This  is 
because  throughput  in  wormhole  and  virtual-channel  flow 
control  is  strongly  influenced  by  buffer  utilization,  which 
depends  on  how  quickly  credits  can  be  sent  and  received, 
prompting  the  re-use  of  buffers.  In  the  simulations  assum¬ 
ing  a  single-cycle  routing  latency,  a  credit  can  be  sent  and 
received  in  2  cycles,  while  in  our  pipelined  model,  a  worm- 
hole  router  needs  4  cycles  to  turnaround  credits,  and  a  vir¬ 
tual-channel  router  needs  5  cycles.  Thus,  throughput  is 
lower  in  the  pipelined  model  than  in  one  which  ignores 
implementation  delay. 

5.  Conclusions 

We  have  presented  a  router  delay  model  which  accu¬ 
rately  accounts  for  pipelining  and  models  routers  with 
canonical  micro-architectures  which  are  tailored  to  its  flow 
control  technique.  Verilog  simulations  based  on  the  model 
show  modest  additional  router  latency  with  virtual-channel 
flow  control  which  is  more  than  offset  by  its  improvements 
in  throughput  over  wormhole  flow  control.  Work  is  cur¬ 
rently  on-going  to  investigate  speculative  virtual-channel 
flow  control,  whch  can  potentially  reduce  the  router 
latency  experienced  by  a  virtual-channel  router  to  that  of  a 
wormhole  router. 


Also,  when  compared  with  simulations  ignoring  pipe¬ 
line  delay  and  assuming  unit  router  latency,  significant  dis¬ 
crepancies  in  base  latency  and  throughput  were  observed, 
supporting  the  importance  of  considering  implementation 
costs  when  simulating  router  performance. 
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